Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video

For compressed images and videos, quality enhancement is essential. Though there have been remarkable achievements related to deep learning, deep learning models are too large to apply to real-time tasks. Therefore, a fast multi-frame quality enhancement method for compressed video, named Fast-MFQE, is proposed to meet the requirement of video-quality enhancement for real-time applications. There are three main modules in this method. One is the image pre-processing building module (IPPB), which is used to reduce redundant information of input images. The second one is the spatio-temporal fusion attention (STFA) module. It is introduced to effectively merge temporal and spatial information of input video frames. The third one is the feature reconstruction network (FRN), which is developed to effectively reconstruct and enhance the spatio-temporal information. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods in terms of lightweight parameters, inference speed, and quality enhancement performance. Even at a resolution of 1080p, the Fast-MFQE achieves a remarkable inference speed of over 25 frames per second, while providing a PSNR increase of 19.6% on average when QP = 37.


Introduction
Nowadays, there is a surplus of ultra-high-definition (UHD) videos accessible for online viewing, imposing substantial strain on communication bandwidth. To transmit videos within the constraint of limited network bandwidth, video compression is vital for reducing the bit rate. However, highly efficient video coding standards, such as H.264/AVC [1] and H.265/HEVC [2], introduce artifacts when using de-correlation and predictive coding techniques, degrading the quality of the video to some extent [3]. As illustrated in Figure 1, after being transmitted in low bandwidth, the reconstructed video is of low quality. The artifacts (i.e., blurring, ringing, blocking and distortion in motion, etc.) are obvious. When these videos are used for subsequent visual tasks, such as object recognition, object detection, and object tracking, etc., the low quality affects the performance dramatically [4,5]. Therefore, quality enhancement for compressed video is crucial for video applications and has emerged as a crucial area of research.
For image or single-frame quality enhancement, traditional methods [6][7][8][9][10][11] aimed to enhance the quality of compressed JPEG images by optimizing the transform coefficients of a specific compression standard. Specifically, refs. [8,9] proposed Shaped-Adaptive DCT (SADCT) and Regression Tree Fields (RTF) to reduce JPEG image blocking artifacts, respectively. Nevertheless, it is challenging to apply these methods to other compression tasks due to the limited generalization ability. With the advance of the deep learning method, an expanding range of methods have embraced convolutional neural network (CNN) approaches [12][13][14][15][16] to improve the compressed image quality. In [12], a four-layer AR-CNN was first introduced to deal with various artifacts in JPEG images. Based on this, Zhang et al. [13] proposed a deep DnCNN for multi-image restoration. Then, based on residual non-local attention, a method named RNAN [17] was proposed to eliminate the image noise. Subsequent methodologies included the use of recursive units and gate units to remove JPEG artifacts [18], as well as the implementation of a dual-stream multi-path recursive residual network [19]. Later on, Lin et al. [20] proposed a multiscale image fusion approach to remove JPEG artifacts effectively, and achieved exceptional objective quality. But these methods cannot be extended to compressed video directly, since they treat frames independently and thus fail to exploit temporal information. To enhance the quality of the compressed video, a 10-layer CNN automatic decoder (DCAD) [21] was the first work to mitigate distortion in compressed videos. In [22], two subnetworks of DS-CNN were introduced to address both intra-frame and inter-frame artifacts. Their main purpose was to enhance the target frame by leveraging the spatial correlation between video frames. There are many multi-frame compressed video enhancement methods [23][24][25][26]. Yang et al. [23] introduced a multi-frame quality enhancement network, named MFQE1.0, which leveraged adjacent high-quality frames to enhance the target frame. MFQE2.0 [24] was an improved version. And then, QG-ConvLSTM [25], a method utilizing bidirectional recurrent convolution, to capture the extensive temporal information. Based on the deformable convolution (DCN), Deng et al. [26] introduced patio-temporal deformable convolution (STDF) to extract temporal information from multiple frames by expanding the input frames to 7 or even 9 frames effectively. These methods aimed to enhance the target frame by leveraging the temporal relationships among multiple video frames primarily. Generally, multi-frame compressed video enhancement tended to achieve better results compared to single-frame enhancement due to its utilization of richer temporal and spatial information. However, these methods for compressed video enhancement faced the following challenges: (1) The parameters of networks are excessively large, which challenges the efficiency of the training and real-time tasks.
(2) Existing methods tend to prioritize enhanced results at the expense of inference speed.
Therefore, it is necessary to explore lightweight and high-performance models for compressed video quality enhancement.
The term "lightweight model" refers to compressing the model size to maximize computational speed while preserving the accuracy. Researchers have been paying increasing attention to developing lightweight models in the field of image classification to enable deployment on mobile devices [14][15][16]27,28]. Among the pioneer endeavors in developing lightweight models, SqueezeNet [29] emerged, replacing 3 × 3 convolutions with 1 × 1 convolutions, resulting in a parameter reduction of approximately one-fiftieth compared to AlexNet [30]. Subsequently, Xception [31] further reduced the parameters by decoupling the Inception structure [32]. The ResNeXt [33] introduced group convolutions and reduced the parameters by integrating the residual network and Inception structure [32] effectively. In 2017, Google introduced MobileNet [34], which pioneered the concept of depthwise separable convolution (DSC) to effectively reduce the parameters in neural networks. Subsequently, MobileNet V2 [35] surpassed the previous performance benchmarks by implementing inverted residual structures and linear bottleneck layers. After that, ShuffleNet [36,37]  To enhance the quality of compressed video and achieve superior inference performance, an end-to-end CNN-based method for VQE task, named Fast-MFQE, is proposed. The main contributions of the method are as follows: (1) A novel IPPB module is designed to reduce the multi-frame information redundancy and fasten the inference speed; (2) STFA and FRN modules are proposed to effectively extract the temporal features and the multi-frame correlation.
More intuitively, the parameters and the performance of diverse VQE methods are shown in Figure 2. It can be seen that compared to state-of-the-art VQE methods, the proposed Fast-MFQE method demonstrates smaller parameters and superior inference performance, as well as improving the quality of the compressed video, such as ∆PSNR and ∆SSIM ×10 −2 . The structure of the remaining sections in this paper is as follows: Section 2 provides a detailed exposition of the proposed method. Section 3 presents the experimental results and provides an analysis of the superior performance of the proposed method. Section 4 concludes the paper.

The Proposed Fast-MFQE
The architecture of the proposed Fast-MFQE is shown in Figure 3, where Depthwise Separable Convolution (DSC) [34] is employed in place of traditional convolution to decrease the computational complexity and enhance the inference speed of the neural network. The primary objective of Fast-MFQE is to generate an enhanced video frameÔ t that closely resembles the Ground-truth frame in the pixel domain. The Ground truth refers to the original uncompressed video frame. To leverage temporal information from adjacent frames, Fast-MFQE takes the target frame V t and its neighboring frame {V t±n } N n=1 as the input of the network. There are three main models in Fast-MFQE; each will be illustrated in the following subsections.

Image Pre-Processing Building Modules (IPPB)
There are prevalent approaches in compressed video enhancement that utilize multiple video frames as input to effectively incorporate temporal information from diverse frames. However, these networks encounter challenges in achieving rapid inference when processing high-resolution video frames due to the substantial increase in computational complexity. Consequently, pre-processing of the input frames becomes imperative to facilitate fast inference in high-resolution video.
To leverage the information from adjacent frame {V t±n } N n=1 to enhance the target frame V t , Fast-MFQE utilizes both {V t±n } N n=1 and V t as inputs to the network. To decrease the data volume of the input frames and enhance the model's inference speed, Fast-MFQE introduces the Image Pre-Processing Building Modules (IPPB) inspired by [34,35]. As depicted in Figure 3, IPPB consists of two primary components: Mean Shift and Pixel Unshuffle.

Mean Shift
In general, video frames exhibit substantial spatial redundancy, and mitigating this redundancy can reduce input data effectively. In a groundbreaking work, Zhang et al. [39] first introduced the Mean Shift operation to image super-resolution tasks and presented the RCAN network, which achieved remarkable results. The Mean Shift operation serves to normalize data by emphasizing individual differences by subtracting the statistical mean value from each image sample. Drawing inspiration from [37,39], Fast-MFQE employs the Mean Shift operation to diminish redundant information in images, enhancing model training speed and consequently reducing the inference time.
Let Fast-MFQE take the adjacent frame {V t±n } N n=1 and the target frame V t as the network input (n = 3), with Mean Shift operation denoted as MS(·). Then, the feature after Mean Shift operation F MS can be expressed as:

Pixel Unshuffle
While reducing redundancy in individual video frames through mean shift operations is effective, downsampling the input data is necessary to further alleviate the computational burden on the network.
Inspired by the FFDnet network [40], Fast-MFQE employs a reversible downsample (R-Downsample) operation to divide the input frames into four sub-frames, aiming to reduce the input data volume within the network. This operation decreases the model's computational cost and inference time while effectively preserving more detailed information. Consequently, it facilitates improved model performance and enhances the generalization ability. It is worth noting that the inverse operation of R-Downsample is denoted as R-Upsample.
Let the four sub-frames generated by the R-Downsample operation be denoted as I l (l = 1, 2, 3, 4), with the R-Downsample denoted as RD(·). Then, the I l can be expressed as:

Spatio-Temporal Attention Fusion (STAF)
Spatio-temporal information of video frames is essential for quality enhancement. To enhance the fusion of image information from different temporal and spatial contexts, Fast-MFQE introduces the Spatio-Temporal Attention Fusion (STAF) module. This module consists of two 3 × 3 convolutions that extract spatial information from the video frames. Subsequently, temporal attention extraction is performed to capture the temporal characteristics of the frames, as described in Figure 4. Then, the spatial and temporal information is fused by concatenating in the manner of channel dimension. Finally, the concatenated information undergoes fusion through a 1 × 1 convolutional layer. This process ensures the effectiveness of the spatial and temporal integration for compressed video enhancement. Let the spatial information extracted by the two 3 × 3 convolutions be denoted as F S and the temporal information extracted by temporal attention be denoted as F T . The fused spatio-temporal feature is denoted as F ST by the 1 × 1 convolution. These features are formulated as follows: where Conv3(·) and Conv1(·) are the mapping functions of 3 × 3 convolution and 1 × 1 convolution, respectively. [·, ·] represents the concatenation operation, and TA(·) is the mapping function of temporal attention.

Feature Reconstruction Network (FRN)
To achieve precise reconstruction of video frames, the Feature Reconstruction Network (FRN) is introduced in Fast-MFQE. As shown in Figure 5, the FRN consists of dense residual blocks, primarily difference learning and residual learning. Difference learning is dedicated to capturing high-frequency information in video frames by calculating element-wise differences in feature maps. And residual learning aims to learn diverse feature information by calculating element-wise additions of feature maps. By incorporating both difference and residual learning, relevant features are captured and integrated effectively, enabling the generation of refined video frames. Let R t denote the reconstructed information generated by the FRN while V t represents the target frame; the reconstructed frames, denoted as R HQ t , can be expressed as follows:

Loss Function
To encourage the enhanced frame R HQ t to be as close as possible to the original uncompressed frame V raw in the pixel domain, Fast-MFQE adopts the mean square error between the enhanced frame R HQ t and the original uncompressed frame V raw as the loss function of the model, which is formulated as follows: where H, W, and C represent the height, width, and number of channels of the image under evaluation, respectively. θ can be learned through the gradient descent algorithm [41] to solve Equation (7). Thus, Fast-MFQE can be trained effectively to enhance the quality of compressed videos.

Experiments
In this section, the effectiveness of the proposed Fast-MFQE method is demonstrated by extensive experiments. The experimental settings are introduced in Section 3.1, and the performance comparisons of the Fast-MFQE method with state-of-the-art methods for JCT-VC testing sequences [2] are illustrated in Section 3.2.

Datasets
The Fast-MFQE model is trained using the dataset introduced in MFQE2.0 [24]. The dataset [24] is divided into two parts. Firstly, 18 sequences from the Joint Collaborative Team on Video Coding (JCT-VC) [2] are commonly utilized as a test set. Secondly, the remaining 142 random sequences are split into non-overlapping training (106 sequences) and validation (36 sequences) sets. All 160 sequences are compressed using HM16.5 [1] in Low-Delay configuration, which is an encoding platform for H.265, with QP set to 27, 32, and 37. The results demonstrate the excellent generalization ability of the proposed method, making it applicable to different QP values.

Quality Enhancement Assessment Metrics
Extensive research [13,[42][43][44] has been conducted to develop efficient and accurate methods for assessing the quality of images and video frames. The quality enhancement evaluation metric is used to measure the quality of distorted images or video frames by comparing the corresponding ground truth quantitatively using full-reference evaluation metrics. In this experiment, two widely used evaluation metrics, namely PSNR and SSIM [42], are employed.

Parameter Settings
The basic settings and hyperparameters of the experiments are presented here. The Fast-MFQE model is trained using the PyTorch framework. Specifically, the Fast-MFQE takes three frames (n = 3) as input to the network. The iteration number is 3 × 10 5 and the mini-batch size is 32. The cropped size is reduced to 128 × 128. The learning rate is set to 1 × 10 −4 and halved every 1 × 10 5 iterations. Note that the above hyperparameters are tuned over the training set. Finally, the parameters of the Fast-MFQE network are updated utilizing the Adam algorithm [41] until the network converges.

Quantitative Comparison
In this section, the performance of the Fast-MFQE model is evaluated using Peak Signal-to-Noise Ratio (PSNR) and Structure Similarity Index Measure (SSIM) quantitatively, which are used objective quality evaluation metrics widely. The Fast-MFQE model is compared with AR-CNN [12], DnCNN [13], RNAN [17], MFQE1.0 [23], and MFQE2.0 [24]. Among these methods, AR-CNN [12], DnCNN [13], and RNAN [17] are methods for enhancing the quality of compressed images, MFQE1.0 [23] is the first method used for multi-frame compressed video enhancement, and MFQE2.0 [24] is the most advanced method for enhancing the quality of compressed videos. To ensure a fair comparison, all of these methods are trained and tested on the same dataset. Table 1 reports the ∆PSNR and ∆SSIM results, which are calculated between enhanced and compressed frames averaged over each test sequence. Note that ∆PSNR > 0 and ∆SSIM > 0 indicate improvement in objective quality for the compressed video. Specifically, compared to the most advanced compressed video enhancement method, MFQE2.0 [24], Fast-MFQE achieves an average increase of 19.6% in PSNR and an average increase of 8.2% in SSIM at QP = 37, and an average increase of 12.2% in PSNR and an average increase of 14.2% in SSIM at QP = 27. Although the enhancement effect of Fast-MFQE is close to that of MFQE2.0 [24], the inference speed of Fast-MFQE is faster. Overall, The Fast-MFQE outperforms all compared methods in terms of objective quality enhancement.

Subjective Comparison
In this section, the subjective evaluation of the Fast-MFQE is mainly focused on the following. As shown in Figure 6, the video frames are enhanced as follows: BasketballDrill at QP = 37, BlowingBubbles at QP = 32, Catus and Traffic at QP = 42. It can be observed that the proposed Fast-MFQE method has sharper edges and more vivid details than other methods. For example, the basketball in BasketballDrill, the face in BlowingBubbles, the words in Catus, and the car in Traffic can be restored with fine textures in Fast-MFQE, which is similar to MFQE2.0 [24].

Comparison of Inference Performance
In this section, the inference capability and the degree of lightweight of the Fast-MFQE model are quantitatively evaluated based on the frame rate and amount of parameters (Param). The Fast-MFQE model is compared with DnCNN [7], RNAN [17], MFQE1.0 [23], MFQE2.0 [24], and STDF [26]. It should be noted that STDF [26] is currently the most lightweight model used for compressed video enhancement. All models are tested on the following configurations: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz, Nvidia GeForce GTX 1080Ti GPU, and Ubuntu 20.04 for the sake of fairness. Table 2 reports the number of parameters (Param) and the frame rate of different models. The Fast-MFQE maintains a speed of over 25 frames per second for all resolution videos. Notably, even when processing 1080p resolution videos, the Fast-MFQE achieves a speed close to 25 frames per second, ensuring smooth and non-stuttering video processing. Additionally, the Fast-MFQE reduces the parameters by 33.4% compared to the current lightest model, STDF [26]. Overall, the Fast-MFQE outperforms the compared methods in terms of model inference speed and parameters.

Ablation Studies
As shown in Table 3, the ablation test was performed at QP = 37. Although Model2 and Model3 have significantly improved the Inference speed, the enhancement effects, such as ∆PSNR and ∆SSIM, are declining sharply. Therefore, in order to better balance the enhancement effect and Inference speed, we choose Model1; that is, the three modules of IPPB, STFA, and FRN are all necessary.

Perceptual Quality Comparison
In this section, the performance of the Fast-MFQE is evaluated quantitatively using the learned perceptual image patch similarity (LPIPS) [45] and perceptual index (PI) [46], which are widely used perceptual quality assessment metrics. Table 4 reports the ∆LPIPS and ∆PI results, which are calculated between enhanced and compressed frames averaged over each test sequence. Note that ∆LPIPS < 0 and ∆PI < 0 indicate improvement in perceptual quality. As shown in this table, the Fast-MFQE is significantly superior to all the compared methods in terms of perceptual quality enhancement.

Subjective Quality and Inference Speed at Different Resolutions
This section focuses on the performance of the proposed Fast-MFQE regarding inference and enhancement across videos with different resolutions. Note that these videos are all tested at QP = 37. As shown in Figures 7-11, we test the inference speed and performance of the model at different resolutions and QP = 37. The results demonstrate that the Fast-MFQE is capable of performing fast inference with high quality across different resolutions.

Conclusions
This paper presents a fast multi-frame quality enhancement approach, named Fast-MFQE, which facilitates efficient model inference. The Fast-MFQE is the first lightweight model in the field of compressed video enhancement. Extensive experiments demonstrate that the Fast-MFQE outperforms previous methods in terms of its lightweight parameters, fast inference speed, and quality enhancement performance on benchmark datasets. Its remarkable attributes make it an ideal solution for real-time applications such as video streaming, video conferencing, and video surveillance, unlocking a range of possibilities in these domains.
Author Contributions: Conceptualization, K.C. and J.C.; writing-original draft preparation, K.C.; writing-review and editing, J.C., H.Z. and X.S. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments:
We sincerely appreciate the anonymous reviewers' critical comments and valuable suggestions for improving the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: